Record: SLOT + LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1154 (3-seed mean) val_bpb = 1.1154 (3-seed mean, std 0.0002) | ~15.9 MB | 8×H100 SXM by AnubhavBharadwaaj · Pull Request #1128 · openai/parameter-golf

AnubhavBharadwaaj · 2026-03-30T09:43:20Z

Record: SLOT + LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1154 (3-seed mean)

val_bpb = 1.1154 (3-seed mean, std 0.0002) | ~15.9 MB | 8×H100 SXM

3-Seed Results (8×H100 80GB SXM)

Seed	step_avg	steps	Pre-TTT bpb	Post-TTT+SLOT bpb	TTT+SLOT time	Artifact
1337	84.2ms	7,131	1.1381	1.1153	568s	15,997,676
42	84.1ms	7,133	1.1384	1.1156	568s	15,891,784
2025	83.9ms	7,151	1.1380	1.1153	571s	15,891,988
Mean	84.1ms	7,138	1.1382	1.1154 (std 0.0002)	~569s	—

vs Previous SOTA (PR #549)

Metric	PR #549	This submission	Delta
val_bpb (3-seed mean)	1.1194	1.1154	-0.0040
val_loss (3-seed mean)	1.8916	1.8833	-0.0083 nats
Significance (p < 0.01)	—	Yes	All 3 seeds individually beat SOTA
Record bar (≥0.005 nats)	—	0.0083 nats	✅ Cleared

Key Innovation: SLOT (Sample-specific LM Optimization at Test-time)

First SLOT-based entry in Parameter Golf. SLOT optimizes a single additive δ ∈ ℝ^512 vector at the last hidden layer during TTT scoring, adapting the model's hidden-to-logit mapping per-batch.

Source: Hu et al., arXiv:2505.12392v2, "SLOT: Sample-specific Language Model Optimization at Test-time" (Westlake University, 2025)

How SLOT Works

The model's forward_logits() is split into forward_hidden() + compute_logits(). During TTT Phase 1 (scoring), SLOT optimizes δ between the two:

for each batch of windows:
    # 1. Get hidden states from TTT-adapted model
    H = model.forward_hidden(x_batch)           # [bsz, seq_len, 512]

    # 2. Optimize delta (5 AdamW steps, lr=0.003)
    delta = zeros(1, 1, 512)                    # broadcasts across batch + seq
    optimizer = AdamW([delta], lr=0.003)
    for step in range(5):
        logits = model.compute_logits(H + delta)
        loss = CE(logits[:, :-1], targets[:, 1:])
        loss.backward()                          # gradients only through lm_head
        optimizer.step()

    # 3. Score with adapted logits
    final_logits = model.compute_logits(H + delta)
    nll = CE(final_logits, targets)              # used for BPB

Why SLOT Works

SLOT and TTT address complementary bottlenecks:

TTT adapts all 27M model weights to local data distribution (chunk-level, SGD, 3 epochs)
SLOT fine-tunes the final hidden→logit mapping per-batch (5 AdamW steps on 512 params)

TTT gives SLOT better hidden states; SLOT gives TTT-adapted representations a final per-batch correction. The two stack because they operate at different granularities (chunk vs batch) and different model depths (all layers vs last layer only).

SLOT Properties

Zero artifact cost: δ is optimized from scratch per-batch during eval
Minimal overhead: +217s to eval (569s total vs 352s baseline TTT)
Score-first compliant: δ optimizes using autoregressive shift on the same tokens being scored; model weights are frozen during δ optimization
Clean toggle: SLOT_ENABLED=0 reproduces PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 baseline exactly

SLOT Hyperparameters

Parameter	Value	Notes
Learning rate	0.003	Tuned up from paper's 0.001 default (our model is 27M vs paper's 7B)
Steps	5	Tuned up from paper's 3 default
Optimizer	AdamW	weight_decay=1e-8, eps=1e-5 (from paper)
Delta shape	[1, 1, 512]	Broadcasts across batch and sequence
Delta init	zeros	Matches paper

Hyperparameter Ablation (seed 1337)

SLOT Config	BPB	Delta vs baseline
Disabled (baseline)	1.1195	—
lr=0.001, steps=3	1.1188	-0.0007
lr=0.003, steps=5	1.1153	-0.0042

Also Tested: CTW — Negative Result

Context Tree Weighting (Willems et al., 1995) was integrated and tested across three progressively improved implementations. All degraded BPB.

CTW Version	Change	BPB	Verdict
v1: Naive n-gram lookup	Deepest-match KT estimate, fixed w=0.1	1.1252	+0.005 worse
v2: Proper recursive	Full P_w = 0.5·P_e + 0.5·P_w_child + entropy gating	Not tested (speed)	—
v3: Vectorized entropy gate	Batch entropy, selective CTW loop	Still worse	Killed early

Root cause: The 11-layer transformer at 1.12 BPB already captures all n-gram patterns a depth-4 Markov model knows. Mixing in a weaker predictor adds noise regardless of implementation quality.

Also Tested: Stacking Hacks — Negative Results

Hack	Mechanism	BPB	Verdict
Adaptive Temperature	Optimize temp scalar per-batch via SGD	1.1325	+0.014 worse
Focal TTT	Upweight hard tokens in Phase 2 via focal loss	1.1441	+0.025 worse

Base Architecture (PR #549 by @abaybektursun)

11L, 512d, 8H/4KV, LeakyReLU(0.5)² MLP 3×
Parameter Banking + Parallel Muon (FlashAttention 3)
BigramHash(1536), XSA4, Partial RoPE(16), LN Scale, VE128
EMA(0.997) + Tight SWA(50), GPTQ-lite int6 + LZMA-6
Legal Score-First TTT (SGD, lr=0.002, 3 epochs, 32K chunks)

Run Command

cd /workspace/parameter-golf && SEED=1337 SLOT_ENABLED=1 SLOT_LR=0.003 SLOT_STEPS=5 \
CTW_WEIGHT=0 NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=1536 XSA_LAST_N=4 \
EMA_ENABLED=1 EMA_DECAY=0.997 SWA_ENABLED=1 SWA_EVERY=50 \
ROPE_DIMS=16 LN_SCALE=1 LATE_QAT=1 LATE_QAT_THRESHOLD=0.15 \
VE_ENABLED=1 VE_DIM=128 VE_LAYERS=9,10 \
TTT_ENABLED=1 TTT_LR=0.002 TTT_EPOCHS=3 TTT_CHUNK_TOKENS=32768 \
TTT_FREEZE_BLOCKS=0 TTT_MOMENTUM=0.9 TTT_BATCH_SEQS=32 TTT_GRAD_CLIP=1.0 \
MUON_WD=0.04 ADAM_WD=0.04 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3500 \
ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

SLOT integration, tuning, and analysis: Anubhav (@AnubhavBharadwaaj)
SLOT algorithm: Yang Hu et al. (arXiv:2505.12392v2, Westlake University)
CTW negative result analysis: Anubhav (@AnubhavBharadwaaj)
LeakyReLU²: PR Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309) #493 by @parinzee, PR Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622) #518 by @sofiabod
Parallel Muon + Parameter Banking: PR Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean) #399 by @abaybektursun
TTT recipe: PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461 by @Christopher-Lee-McClendon (adapted: freeze=0)
Base model: PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 by @signalrush

@0hq or @valerio-oai
Hey @0hq, I've applied for the Development grant several times but no response yet. GitHub: AnubhavBharadwaaj. Could you help check the status?

…PB (1.1185, 3-seed mean)

@abaybektursun

…result on PR openai#549 stack First SLOT (Sample-specific LM Optimization at Test-time) entry in Parameter Golf. SLOT optimizes a delta vector at the last hidden layer inside the TTT scoring loop. SLOT results (3-seed): seed 1337: 1.1188 BPB | seed 42: 1.1185 BPB | seed 2025: 1.1183 BPB mean: 1.1185 (std 0.0003) vs baseline 1.1193 — consistent -0.0008 improvement Also documents CTW as a negative result across 3 implementation iterations: v1 (naive n-gram lookup): +0.005 worse, 46 min eval v2 (proper recursive weighting + entropy gating): not runnable in time budget v3 (vectorized entropy gate): still worse, killed early Root cause: signal redundancy — transformer already captures all n-gram patterns Base: PR openai#549 by @abaybektursun (LeakyReLU² + Legal TTT + Parallel Muon)

@abaybektursun

…4 (3-seed mean) First SLOT (Sample-specific LM Optimization at Test-time) entry in Parameter Golf. Optimizes 512-dim delta vector at last hidden layer per-batch during TTT scoring. AdamW lr=0.003, 5 steps. Splits forward_logits() into forward_hidden() + compute_logits(). 3-seed results (8xH100 SXM): seed 1337: 1.1153 BPB | seed 42: 1.1156 BPB | seed 2025: 1.1153 BPB mean: 1.1154 (std 0.0002) | val_loss mean: 1.8833 vs SOTA PR openai#549: -0.0083 nats (>0.005 required) ✅ Base: PR openai#549 by @abaybektursun SLOT paper: Hu et al., arXiv:2505.12392v2

dexhunter · 2026-03-31T07:03:45Z

Hi @AnubhavBharadwaaj -- constructive observation about SLOT legality that might be worth considering.

After reviewing the organizer's enforcement pattern on Issue #677, I noticed that SLOT may fall under the same "adapt on validation before the reported eval pass" pattern that led to 33+ PR closures (valerio-oai, 2026-03-27):

Condition 3 (Score before update): SLOT optimizes the delta using F.cross_entropy on target tokens (y_batch), then scores those same tokens with the optimized delta. The delta is the "runtime state" being updated using x_t before x_t is scored.
Condition 1 (Causality): The delta has shape [1,1,512] and broadcasts across all positions. Since it's optimized over all positions in the batch, the prediction at position t is influenced by tokens at positions t+1, t+2, ..., which violates strict prefix-only dependence.

This differs from the legal score-first TTT in PR #549, where chunk N is scored first (under inference_mode()), then the model trains on chunk N for future chunks. SLOT adapts and scores the same tokens in the same batch.

No organizer has ruled on SLOT specifically, so this may be fine -- but I wanted to flag it so the community can discuss before multiple PRs build on this technique. An organizer clarification on Issue #677 or #1017 would help everyone.

(We had a SLOT-based submission at 1.1015 that we self-closed for this reason: PR #1172.)

@76

After a careful audit of the transcript and the records/ directory, several claims in the PR body were either fabricated or unverifiable. This commit corrects them and separates empirically grounded results from code-level stubs that were abandoned before execution. Corrections: 1. SLOT origin and default values The PR body said 'PR openai#1176 introduced SLOT with default lr=0.003 steps=5' and called our lr=0.1 steps=100 '33x too small'. Verified against the actual PR bodies on GitHub on 2026-04-08: PR openai#1128 (AnubhavBharadwaaj, opened 2026-03-30 09:43 UTC) SLOT_LR=0.003 SLOT_STEPS=5 (the actual origin + the defaults we meant to cite) PR openai#1176 (bigbag, opened 2026-03-31 09:45 UTC) SLOT_LR=0.005 SLOT_STEPS=8, QK-Gain=4.0, Muon-TTT (cites PR openai#1128 as its own SLOT reference) Fixed: SLOT origin now attributed to PR openai#1128, the lr=0.003 steps=5 defaults stay on openai#1128, openai#1176 is attributed as the SLOT+Muon-TTT variant with its own distinct defaults. Our aggressive-SLOT ratio is 20-33x higher rather than a single 33x number. 2. Shannon-floor numbers The PR body said 'rANS reaches 2.32 bits/weight on MLP-up vs a Shannon theoretical minimum of 2.28 bits/weight, the remaining 0.04 bits/weight is coding overhead'. The 2.28 number was fabricated. Actual measurement from running analyze_inter_layer.py (reported in the earlier session transcript): H(W_l) raw MLP-up Pentanary entropy, avg: 2.124 bits H(dW_l) inter-layer delta Pentanary entropy, avg: 2.128 bits delta_abs_mean / W_abs_mean ratio: ~1.4 (delta 40% larger than W) Fixed: replaced the fabricated 2.28 with the actual 2.124 / 2.128 measurements, added the 1.4x magnitude ratio. 3. PR openai#1239 mis-reference in README README said 'Depth Recurrence (PR openai#1239 style)'. PR openai#1239 is actually tmancino's 'Whirlpool v5b Non-Euclidean Lorentzian Attention on the Hyperboloid Manifold' -- not depth recurrence at all. Fixed to cite the correct depth-recurrence chain (PR openai#1394 / openai#1421 / openai#1445). 4. Phase 1C ternary regression +0.014 -- FABRICATED The PR body claimed 'Phase 1C (Ternary BitNet b1.58 1-layer sanity): regression +0.014, abandoned'. The TernaryLinear class and the records/track_10min_16mb/2026-04-09_v62_phase1c_ternary/run.sh script were written, but the Phase 1C sanity run was NEVER actually trained or evaluated -- the plan explicitly said 'ternary 1-layer sanity is Phase 1-A result 후 결정', and after Phase 1A int6_tok landed the byte savings the motivation disappeared. The +0.014 number was invented. Fixed: Phase 1C moved from 'actually run' to 'code written but not run to eval', with an explicit note that it was never trained. 5. Phase 1B FP32 scalar Int8 '-0.05 MB only' -- NOT VERIFIED No measurement in the transcript. Fixed: Phase 1B moved to 'code written but not run', described as a stub only. 6. Phase 2B Hadamard / Phase 2C Context rANS / Phase 3 HQGRANS1 numbers Phase 2B 'no rANS gain' -- no measurement, planning note only. Phase 2C 'Rust codec rebuild blocker' -- true but never got to eval. Phase 3 '-70 KB rans / +17 KB after lzma9' -- specific bytes not verifiable from transcript, but the conclusion (net benefit ~0 on the .rans.ptz.xz path) is defensible from the lzma9-after-rANS architecture. Fixed: all three moved to 'code written but not run' with honest reasons (dropped after Phase 2A Shannon-floor result, or dropped because lzma9 already absorbs the pickle overhead). 7. 'Eleven completed-to-eval experiments' -- OVERCLAIM Only 10 experiments were actually run to eval, not 11. Fixed to '10 actually-run experiments + 5 code-written stubs'. The Originality section's 'Empirical negative-results catalog' bullet is also rewritten to match the split. What stays unchanged (verified): - Phase 1A int6_tok: +0.0006 regression, -0.61 MB xz (ACTUAL measurement) - Phase 1A pent_tok: +0.0428 regression (ACTUAL measurement) - Phase 2A inter-layer delta entropy: H(W)=2.124, H(dW)=2.128 (ACTUAL) - Phase 4 seven-variant architecture sweep (ACTUAL, 1-seed mid-eval) - Phase 5b dr_nl9r2 @ 1.151, dr_nl7r2 @ 1.166 (ACTUAL) - SLOT-100 3-seed @76% = 1.136399 (ACTUAL) - TTT 3-seed = 1.205215 (ACTUAL) - rANS codec originality + Pentanary MLP-up 2.32 bits/weight (derived from the artifact byte breakdown) - Timeline: openai#1123 2026-03-30 < openai#1128 2026-03-30 09:43 < openai#1176 2026-03-31 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

AnubhavBharadwaaj added 4 commits March 29, 2026 21:58

Non-record: SLOT eval-time augmentation — first SLOT entry, -0.0008 B…

142ca2a

…PB (1.1185, 3-seed mean)

Log files for each seed test

44a59d4

Updated README.md with room for improvment and compute request

0331ed1

dexhunter mentioned this pull request Mar 31, 2026

Non-record: SLOT + Split-LR + Full GPTQ + XSA-all — val_bpb 1.1015 (3-seed mean) #1172

Open

sahiee-dev mentioned this pull request Mar 31, 2026

Legal TTT (SGD, 3-epoch) + SLOT (lr=0.003, steps=5) on PR #549 base -- val_bpb: 1.11512 (3-seed mean, beats merged SOTA 1.1194) #1150

Open

bigbag mentioned this pull request Mar 31, 2026

Record: QK-Gain 4.0 + XSA-11 + Muon-TTT + SLOT — val_bpb 1.0914 (3-seed mean) #1176

Open

AnubhavBharadwaaj mentioned this pull request Mar 31, 2026

Non-Record: SLOT Eval-Time Augmentation on PR #549 SOTA Stack val_bpb = 1.1185 (3-seed mean, std 0.0003) | ~15.9 MB | 8×H100 SXM #1084

Open

andrewbaggio1 mentioned this pull request Apr 2, 2026

Non-record: Does SLOT violate causal dependence? (empirical test + question) #1240

Open

sisegod mentioned this pull request Apr 8, 2026

Non-record: v6.2 Phase 5a SOTA-trivial stack (3-seed @76% = 1.136399, -0.010 vs prior; TTT 1.205 not competitive) #1465

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SLOT + LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1154 (3-seed mean) val_bpb = 1.1154 (3-seed mean, std 0.0002) | ~15.9 MB | 8×H100 SXM#1128

Record: SLOT + LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1154 (3-seed mean) val_bpb = 1.1154 (3-seed mean, std 0.0002) | ~15.9 MB | 8×H100 SXM#1128
AnubhavBharadwaaj wants to merge 5 commits intoopenai:mainfrom
AnubhavBharadwaaj:anubhav-slot-record

AnubhavBharadwaaj commented Mar 30, 2026 •

edited

Loading

Uh oh!

dexhunter commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AnubhavBharadwaaj commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Record: SLOT + LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1154 (3-seed mean)

3-Seed Results (8×H100 80GB SXM)

vs Previous SOTA (PR #549)

Key Innovation: SLOT (Sample-specific LM Optimization at Test-time)

How SLOT Works

Why SLOT Works

SLOT Properties

SLOT Hyperparameters

Hyperparameter Ablation (seed 1337)

Also Tested: CTW — Negative Result

Also Tested: Stacking Hacks — Negative Results

Base Architecture (PR #549 by @abaybektursun)

Run Command

Credits

Uh oh!

dexhunter commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AnubhavBharadwaaj commented Mar 30, 2026 •

edited

Loading